Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf.md: Add GGUF Naming Convention Section #822

Merged
merged 9 commits into from
May 17, 2024

Conversation

mofosyne
Copy link
Contributor

@mofosyne mofosyne commented May 14, 2024

#820

This PR is based on outfile default name generation in ggerganov/llama.cpp#4858, copied from there but removed historical references and justification to why it was designed that way.

Feedback and adjustment will be appreciated. Any changes to this will mean we also need to update llama.cpp default name generation as well.


In addition, is there any filename generation in this repo? If so then we may want to also update it as well to use this common naming scheme.

@julien-c
Copy link

interesting! we could probably parse it on HF side in the future if it makes sense and if it unlocks cool features (we already attempt to extract quantization type from filename but this could make it more robust. cc @mishig25)

@Vaibhavs10
Copy link

If it helps, we follow somewhat similar (but not exhaustive) in the gguf-my-repo quantisation space: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L67

Standardisation in file name is always a great move!

@mofosyne
Copy link
Contributor Author

mofosyne commented May 14, 2024

If it helps, we follow somewhat similar (but not exhaustive) in the gguf-my-repo quantisation space: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L67

Standardisation in file name is always a great move!

Had a quick look. Do you mean your current naming arrangement is kind of like

  • <model_name>/<model_name.lower()>.<q_method.upper()>.gguf so something like TinyStories/tinystories-F16.gguf

If so then do you have a preferred form? I came to this form basically by casual observation of typical naming scheme in huggingface, hence <Model>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf (where version field is optional).

But obviously it's research by vibes, so it be better if I had some feedback, especially for those that be forced to try and parse such files. Ergo @JidongZhang-THU would it make it easier for you if we made 'version' not optional? (Expert count being optional is okay as it's easy to tell that x is either there or not there).

And of course... did I miss anything that would be useful for people parsing models file names?

docs/gguf.md Outdated Show resolved Hide resolved
@Vaibhavs10
Copy link

Let me add a bit of backstory as to why we chose this naming scheme (which I'm more than happy to change):

A typical user of the quantisation space would want to create quants for an arbitrary model on the Hub.
To preserve the info about the model itself, we copy the model-id (which can be different from the model name, for example there are 4.5k+ mixtral models on the hub all with different names based on their training set, and other metadata.) and then append the information about the quant time along with.

The model name typically already would have the information about expert count + parameters.

I'm open to ideas to align better, I just thought I'd provide more context.

@mofosyne mofosyne force-pushed the 820-naming-convention-for-gguf branch 3 times, most recently from 1a02aee to 43e8e45 Compare May 14, 2024 13:45
@mofosyne mofosyne force-pushed the 820-naming-convention-for-gguf branch from 43e8e45 to 1bf1ab5 Compare May 14, 2024 13:49
@mofosyne
Copy link
Contributor Author

@julien-c do you have a preference when it comes to parsing filenames? I'm basically treating it as a sort of - dash separated value (in which case, I should probably make the version field mandatory).

@julien-c
Copy link

@mofosyne no, we'll adapt!

@mofosyne mofosyne force-pushed the 820-naming-convention-for-gguf branch from b489520 to 7d9bd43 Compare May 15, 2024 07:41
docs/gguf.md Outdated Show resolved Hide resolved
@mofosyne
Copy link
Contributor Author

Thanks for the historical context. I might have gotten a bit crazy here, but I've ended up mapping each enum name to the tensor type description and the historical context behind each PR that relates to it's initial inclusion...

Not even sure if it's allowed on this gguf.md page, so just attaching it to this comment in case I should remove it. But hopefully it helps provide a general glance of each gguf file type.

Oh and i've updated the page a bit. Opted for 'tensor type' rather than 'file name' as that appears to make more sense to me at least.

TensorType enum ggml_ftype Tensor Type Description (Plus link to initial commit in llama.cpp for historical context)
F32 GGML_FTYPE_ALL_F32 32-bit IEEE 754 llama.cpp CM
F16 GGML_FTYPE_MOSTLY_F16 16-bit IEEE 754 llama.cpp CM
Q4_0 GGML_FTYPE_MOSTLY_Q4_0 4 bit quant (scaling only) llama.cpp CM
Q4_1 GGML_FTYPE_MOSTLY_Q4_1 4 bit quant (scaling plus offset) llama.cpp CM
Q4_1_F16 GGML_FTYPE_MOSTLY_Q4_1_SOME_F16 4 bit quant (scaling plus offset) except for tok_embeddings and output weights which are F16 llama.cpp CM
Q8_0 GGML_FTYPE_MOSTLY_Q8_0 8 bit quant (scaling only) llama.cpp CM
Q5_0 GGML_FTYPE_MOSTLY_Q5_0 5 bit quant (scaling only) llama.cpp CM
Q5_1 GGML_FTYPE_MOSTLY_Q5_1 5 bit quant (scaling plus offset) llama.cpp CM
kQ2 GGML_FTYPE_MOSTLY_Q2_K 2 bits k-quant (SOTA) llama.cpp PR
kQ3 GGML_FTYPE_MOSTLY_Q3_K 3 bits k-quant (SOTA) llama.cpp PR
kQ4 GGML_FTYPE_MOSTLY_Q4_K 4 bits k-quant (SOTA) llama.cpp PR
kQ5 GGML_FTYPE_MOSTLY_Q5_K 5 bits k-quant (SOTA) llama.cpp PR
kQ6 GGML_FTYPE_MOSTLY_Q6_K 6 bits k-quant (SOTA) llama.cpp PR
iQ2_XXS GGML_FTYPE_MOSTLY_IQ2_XXS 2.0625 bits per weight quants (SOTA) llama.cpp PR
iQ2_XS GGML_FTYPE_MOSTLY_IQ2_XS 2.31 bits per weight quants (SOTA) llama.cpp PR
iQ3_XXS GGML_FTYPE_MOSTLY_IQ3_XXS 3.0625 bit per weight quants (SOTA) llama.cpp PR
iQ1_S GGML_FTYPE_MOSTLY_IQ1_S 1.5 bit per weight quants llama.cpp PR
iQ4_NL GGML_FTYPE_MOSTLY_IQ4_NL 4 bit non-linear quants with blocks of 32 llama.cpp PR
iQ3_S GGML_FTYPE_MOSTLY_IQ3_S 3.4375 bits per weight quants llama.cpp PR
iQ2_S GGML_FTYPE_MOSTLY_IQ2_S 2 to 3 bit per weight quants llama.cpp PR
iQ4_XS GGML_FTYPE_MOSTLY_IQ4_XS 4.25 bits per weight quants llama.cpp PR
iQ1_M GGML_FTYPE_MOSTLY_IQ1_M 1.75 bits per weight quants llama.cpp PR
BF16 GGML_FTYPE_MOSTLY_BF16 bfloat16 (truncated 32-bit IEEE 754) llama.cpp PR

@mishig25
Copy link

mishig25 commented May 15, 2024

@mofosyne, I've made a similar table of quant descriptions in https://huggingface.co/docs/hub/gguf#quantization-types (sharing just in case if there's any useful info)

image

@mofosyne
Copy link
Contributor Author

mofosyne commented May 15, 2024

@mishig25 thanks. Decided to cross reference your table with what I got and this is the breakdown i was able to figure out. I'm not 100% sure on all the superblock configuration for the i-quantization based on your statement and the llama.cpp PR description, but I was able to extract some out. I think gg would be a clearer source of truth here (especially some of my general assertions below).

Encoding Scheme Name Table

Scheme ggml_ftype C enumeration name Bits/Weight Data Type Block Configuration Quantized Weight Formula Initial Commits Or Pull Request Sources
F32 GGML_FTYPE_ALL_F32 32 32-bit IEEE 754 - - llama.cpp CM
F16 GGML_FTYPE_MOSTLY_F16 16 16-bit IEEE 754 - - llama.cpp CM
Q4_0 GGML_FTYPE_MOSTLY_Q4_0 4 round to nearest quantization Each block has 32 weights w = q * block_scale llama.cpp CM
Q4_1 GGML_FTYPE_MOSTLY_Q4_1 4 round to nearest quantization Each block has 32 weights w = q * block_scale + block_minimum llama.cpp CM
Q4_1_F16 GGML_FTYPE_MOSTLY_Q4_1_SOME_F16 4 round to nearest quantization Each block has 32 weights (token embedding and output weights are F16) w = q * block_scale + block_minimum llama.cpp CM
Q8_0 GGML_FTYPE_MOSTLY_Q8_0 8 round to nearest quantization Each block has 32 weights w = q * block_scale llama.cpp CM
Q5_0 GGML_FTYPE_MOSTLY_Q5_0 5 round to nearest quantization Each block has 32 weights w = q * block_scale llama.cpp CM
Q5_1 GGML_FTYPE_MOSTLY_Q5_1 5 round to nearest quantization Each block has 32 weights w = q * block_scale + block_minimum llama.cpp CM
KQ2 GGML_FTYPE_MOSTLY_Q2_K 2.5625 k-quantization Superblocks with 16 blocks, each block has 16 weights w = q * block_scale (4-bit) + block_min (4-bit) llama.cpp PR
KQ3 GGML_FTYPE_MOSTLY_Q3_K 3.4375 k-quantization Superblocks with 16 blocks, each block has 16 weights w = q * block_scale (6-bit) llama.cpp PR
KQ4 GGML_FTYPE_MOSTLY_Q4_K 4.5 k-quantization Superblocks with 8 blocks, each block has 32 weights w = q * block_scale (6-bit) + block_min (6-bit) llama.cpp PR
KQ5 GGML_FTYPE_MOSTLY_Q5_K 5.5 k-quantization Superblocks with 8 blocks, each block has 32 weights w = q * block_scale (6-bit) + block_min (6-bit) llama.cpp PR
KQ6 GGML_FTYPE_MOSTLY_Q6_K 6.5625 k-quantization Superblocks with 16 blocks, each block has 16 weights w = q * block_scale (8-bit) llama.cpp PR
IQ2_XXS GGML_FTYPE_MOSTLY_IQ2_XXS 2.0625 i-quantization Superblocks with 8 blocks, each block has 32 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ2_XS GGML_FTYPE_MOSTLY_IQ2_XS 2.31 i-quantization Superblocks with 16 blocks, each block has 16 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ3_XXS GGML_FTYPE_MOSTLY_IQ3_XXS 3.0625 i-quantization Superblocks with 8 blocks, each block has 32 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ1_S GGML_FTYPE_MOSTLY_IQ1_S 1.5 i-quantization Superblocks with 8 blocks, each block has 32 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ4_NL GGML_FTYPE_MOSTLY_IQ4_NL 4.5 i-quantization Superblocks with 16 blocks, each block has 16 weights w = [non linear mapping of quants to weights] llama.cpp PR
IQ3_S GGML_FTYPE_MOSTLY_IQ3_S 3.4375 i-quantization ? w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ2_S GGML_FTYPE_MOSTLY_IQ2_S 2.5 i-quantization ? w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ4_XS GGML_FTYPE_MOSTLY_IQ4_XS 4.25 i-quantization Superblocks with 8 blocks, each block has 32 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
IQ1_M GGML_FTYPE_MOSTLY_IQ1_M 1.75 i-quantization Superblocks with 16 blocks, each block has 16 weights w = func(superblock_scale, importance_matrix) llama.cpp PR
BF16 GGML_FTYPE_MOSTLY_BF16 16 bfloat16 (trunc 32b IEEE754) - - llama.cpp PR
  • All superblocks have fp16 scaling factor and contains up to 256 weights. Number of weights in a block must be divisible by 256.

@mofosyne mofosyne requested review from ggerganov and Green-Sky May 16, 2024 01:28
@mofosyne
Copy link
Contributor Author

mofosyne commented May 16, 2024

@mishig25 when you made the table, were you able to figure out the superblocks makeup and how to represent the weight formulae (in general)? Also in your opinion, is this table in the right location or should it be split up (and if so then where)?

(And on a meta note... how much information should we really expose in this document... too much can confuse developers)


edit: Justine T also mentioned regarding my 'Weights Encoding Scheme' table is that I may have issue using different name for quants than what the software (presumably llama.cpp) uses. So I guess we could say this is not a super hard and fast mapping, but can include other variants... but for the context of ggml this is the base scheme name. llama.cpp can then define their own extra naming (e.g. _S, _M and _L) in their own documentation (as extra pointers for users of what to expect).

@ggerganov
Copy link
Owner

(And on a meta note... how much information should we really expose in this document... too much can confuse developers)

IMO, the entire encoding section should just be reduced to simply:

Indicates the weights encoding scheme that was applied to the model. Content, type mixture and arrangement however are determined by user code and can vary depending on project needs.

The rest of the information is specific mainly to llama.cpp and not relevant to the GGUF format

@mofosyne
Copy link
Contributor Author

@ggerganov thanks, looks much more compact and focused now.

@mishig25 @julien-c @Vaibhavs10 @Green-Sky let's lock this in?

(Wonder if the table with bits, datatype, block config, etc... be useful anywhere, such as llama.cpp documentation and if so then which specific location)

@ggerganov ggerganov merged commit 9988298 into ggerganov:master May 17, 2024
@mofosyne
Copy link
Contributor Author

@ggerganov thanks for the merge

I've decided to place the table to https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes . Turns out github wiki kind of suck at rendering tables, but hope it's of help to everyone here.

@mofosyne mofosyne deleted the 820-naming-convention-for-gguf branch May 17, 2024 06:50
@@ -18,6 +18,43 @@ GGUF is a format based on the existing GGJT, but makes a few changes to the form

The key difference between GGJT and GGUF is the use of a key-value structure for the hyperparameters (now referred to as metadata), rather than a list of untyped values. This allows for new metadata to be added without breaking compatibility with existing models, and to annotate the model with additional information that may be useful for inference or for identifying the model.

### GGUF Naming Convention

GGUF follow a naming convention of `<Model>-<Version>-<ExpertsCount>x<Parameters>-<EncodingScheme>.gguf`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mofosyne great work!

Maybe the only missing information is to add the optional suffix about the shard info.
Example: "grok-1/grok-1-q4_0-00003-of-00009.gguf"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants